Exploratory Data Analysis

We divide the labels from the rest of the dataset obtaining X with only features and y with only labels.

We see that our dataset is composed of 1300 rows (samples) and 35 columns (features).

Now we scale the dataset so that the models can use the features on the same scale. Later on we can see if this change will help achieve a better outcome of the classifiers.

Data Visualization

Next we explore and visualize the features.

It seems like the dataset follows a normal distribution.

We will now perform a 2D plotting of the features to see if we can distinguish between the labels.

From the plots above we can deduce that our data follows a normal distribution centered at 0. In fact, as checked before, the mean value of the data is around 0. Moreover, the scaling applied before is indeed going to be needed.

From this plot we can deduce that there is a large number of features which are almost perfectly uncorrelated.

Dimensionality reduction

We try using dimensionality reduction through PCA to see if we can distinguish the two labels.

These graphs illustrate the accounted variance from the most important components.

Finding an elbow in the cumulative variance graph is not very easy. Actually, just after the initial three components, the graph starts to reveal a pretty much steady increase in cumulative variance.

Now we implement an interactive 3d plot to better explore.

The plots shown above, taking the first two and three principal components, don't help us identify clearly clusters of data. Therefore we try applying t-SNE.

From the plots shown above we can rougly see two groups but we can't deduce anything conclusive regarding the dividion of the data. In the next section we will build two classifiers, Linear SVM and Random Forest, to try to predict the labels.

Classifiers

SVM

Now we implement SVM on the two datasets trying to make predictions for the labels on the data. First we try SVM with LinarSVC.

Original dataset

Scaled datatest

The predictions are approximately equal in both datasets (0.81). Next, we try tuning the parameters with GridSearch using the original dataset to try to increase the accuracy.

Hyperparameters' Tuning

We get a slight increase in the accuracy after implementing GridSearch. Next we print the classification report and plot the confusion matrix and the Receiver Operating Characteristic curve to learn more about the precision of this model.

Precision: The ability of a classifier to avoid labeling a negative sample as positive. If precision is high, that means the classifier does a good job of not misclassifying negative instances.

Recall: The ability of a classifier to find all the positive instances. If recall is high, that means the classifier does a good job of finding positive instances.

F1-score: The harmonic mean of precision and recall. It tries to balance these two values. A high F1 score means that both precision and recall are high.

Support: Support is the number of actual occurrences of the class in the specified dataset.

In our case we get high values that suggest the model is making correct predictions more often.

The confusion matrix indicates that we accurately predicted the majority of instances. The ROC curve illustrates the true positive rate plotted against the false positive rate at different threshold settings. The area under the ROC curve indicates the model's performance, with a high value indicating a better model—this is the case for us.

SVM with Kernels

Now, to try to get an even higher accuracy we implement SVM with different kernels.

Hyperparameters' Tuning

This already improves the accuracy by 0.13 percetage points! How does this change when we apply GridSearch randomized with the scaled dataset?

Next we use the best estimators found after some tries of Randomized Search.

The classification report, confusion matrix and the ROC curve significantly improved. Therefore, we can deduce that this last model worked better.

Using Grid Search and Randomized Grid Search, we chose to select a number of iterations that is neither too small nor too high in order to optimize the most important parameters while considering the running time. We tried to achieve the best possible performance in the least amount of time.

Random Forest

Now we implement a random forest classifier.

Original dataset

Scaled dataset

The accuracies are all equal. Now we apply GridSearch to the original dataset to see if we can get a higher accuracy.

Hyperparameters' Tuning

After many tries, this approach doesn't seem to produce a higher accuracy. As a consequence, we try implementing Randomized Grid Search.

Next we use the best estimators found after some tries of Randomized Search.

We get a increase in the accuracy, from approximately 87% to 88%. From the classification report and the confusion matrix we can deduce that this tuning of the parameters is actually working better than GridSearch.

Next, to better understand the choice of the n_estimators, we plot the accuracy against the number of estimators.

This suggests that the optimal quantity of estimators is 500, which is approximately around what we get after a few tries of Randomized GridSearch. This number of estimators paired with the other parameters, turns out to be a good pick. This strategy leads in fact to a good increase in accuracy.

Here follows the plot of a tree from the random forest.

We can then plot the features in descending order of importance. Given the distribution of our data we expect small values.

Predictions

Conclusion

We began our project by taking a closer look at our dataset. There were no missing values and we saw that the label column is balanced between the two classes. We looked at the distribution of the feature columns as well as at the statistical summary of our dataset. Then we visualized the data to see if there were any clear distinctions between classes. Moreover, the dimensionality reductions we tried didn't find any significant differences between the labels.

At this point, we started implementing Support Vector Machine (SVM) and Random Forest classifiers. Initially, we inclined towards the Random Forest classifier because it got an accuracy rate of approximately 88%, outperforming the Linear SVM's 82%. The parameters we settled on for this model were RandomForestClassifier(n_estimators = 389, min_samples_split = 2, min_samples_leaf = 1, bootstrap = False).

However, upon deeper exploration and optimization of the SVM classifier, we achieved a significantly improved outcome. Using a combination of Randomized Grid Search and the radial basis function (rbf) kernel with the scaled dataset, we managed to attain an impressive accuracy rate of around 96%. As a result, we decided to choose this classifier for our predictions. The selected parameters included svm_opt = SVC ( C = 1.592282793341094, gamma= 0.08111308307896872, kernel = 'rbf', random_state = 45 ), with the remaining parameters kept at their default values.

We should note that we did not anticipate perfect performance, as, from the beginning, distinguishing clusters based on labels was difficult. The tuning of hyperparameters played a critical role in boosting our model's performance. Specifically, the Randomized Grid Search helped us achieve a notable increase in accuracy.

While trying to find the most suitable parameters for our dataset, we also took the models' computational cost into consideration and selected those parameters that appeared to contribute more significantly in the increasing of the performance.